Detecting multiword phrases in mathematical text corpora

نویسنده

  • Winfried Gödert
چکیده

We present an approach for detecting multiword phrases in mathematical text corpora. The method used is based on characteristic features of mathematical terminology. It makes use of a software tool named Lingo which allows to identify words by means of previously defined dictionaries for specific word classes as adjectives, personal names or nouns. The detection of multiword groups is done algorithmically. Possible advantages of the method for indexing and information retrieval and conclusions for applying dictionary-based methods of automatic indexing instead of stemming procedures are discussed. Problems and goals We start by discussing an example. Given is the text of an abstract for a paper with mathematical content: "We study some rigidity properties for locally symmetrical Finsler manifolds. We obtain the local equivalent characterization for a Finsler manifold to be locally symmetric and prove that any locally symmetrical Finsler manifold with nonzero flag curvature must be Riemannian. We also generalize a rigidity result due to Akbar Zadeh." Looking for methods that will generate index terms automatically and that will have good representation and equally discrimination properties for retrieval purposes, the following question may be of interest: Which of the words are part of a multiword phrase representing a mathematical concept or a proper entity of mathematical terminology? Intellectual analysis can identify the following phrases: • rigidity properties • locally symmetrical Finsler manifold(s) • local equivalent characterization • nonzero flag curvature • rigidity result We have cited the respective longest sequences with a proper meaning. These sequences can contain shorter ones with normally a generic superordinated meaning. Next, we ask the following questions. Is it possible to identify sequences by applying automatic techniques? Is it possible to identify as much as possible sequences of words that can be seen as representations of mathematical concepts? Is it possible to avoid identification of almost all sequences 1 Abstract taken from the database Zentralblatt MATH (http://www.zentralblatt-math.org/zmath/) with permission of the editorial staff.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantic Lexicon Induction from Twitter with Pattern Relatedness and Flexible Term Length

With the rise of social media, learning from informal text has become increasingly important. We present a novel semantic lexicon induction approach that is able to learn new vocabulary from social media. Our method is robust to the idiosyncrasies of informal and open-domain text corpora. Unlike previous work, it does not impose restrictions on the lexical features of candidate terms – e.g. by ...

متن کامل

Johan Segura and Violaine Prince Using Alignment to detect associated multiword expressions in bilingual corpora

Translating multiword expressions from a language to another needs to recognize them as such. Bilingual multiword expressions are an issue when they are not the exact word-toword translation of each other. The following examples are provided for a French-English translation task: (1) Phrasal verbs such as « to call in on » becoming « rendre visite », (2) « sorry to hear that », that a human tra...

متن کامل

Extracting Transfer Rules for Multiword Expressions from Parallel Corpora

This paper presents a procedure for extracting transfer rules for multiword expressions from parallel corpora for use in a rule based Japanese-English MT system. We show that adding the multi-word rules improves translation quality and sketch ideas for learning more such rules.

متن کامل

Yet Another Ranking Function for Automatic Multiword Term Extraction

Term extraction is an essential task in domain knowledge acquisition. We propose two new measures to extract multiword terms from a domain-specific text. The first measure is both linguistic and statistical based. The second measure is graph-based, allowing assessment of the importance of a multiword term of a domain. Existing measures often solve some problems related (but not completely) to t...

متن کامل

Identifying well-formed biomedical phrases in MEDLINE® text

In the modern world people frequently interact with retrieval systems to satisfy their information needs. Humanly understandable well-formed phrases represent a crucial interface between humans and the web, and the ability to index and search with such phrases is beneficial for human-web interactions. In this paper we consider the problem of identifying humanly understandable, well formed, and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1210.0852  شماره 

صفحات  -

تاریخ انتشار 2012